Introducing a new, hierarchical amdgpu_family_matrix layout by HereThereBeDragons · Pull Request #3248 · ROCm/TheRock

HereThereBeDragons · 2026-02-04T11:10:28Z

This PR is part of enabling CI weekly (big picture PR #1732 ) . For this, refactoring of amdgpu_family is needed for easier selection of a specific gpu, and not having to rely on knowing in which group the specific gpu is part of (presubmit/postsubmit/nightlies).

New layout puts all gpus in a single dictionary amdgpu_family_info_matrix_all. Choices for pre/postsubmit and nightlies are now done via a list.

amdgpu_family_info_matrix_all has more depth in hierarchy to better define which parameters belong to which step: build, test, release.

Tests can be now turned off with a single bool flag. no need to comment out the runner
Default runner labels are "test", "benchmark" and "test-runs-on-multi-gpu". Further labels can be introduced, e.g. "oem" which in the future can overwrite the default runners.

This new layout is introduced in parallel to the old one. The old one stays unchanged to allow gradual move for the CI to use the new layout.

HereThereBeDragons · 2026-02-04T11:12:46Z

To review this you need to compare it with: build_tools/github_actions/amdgpu_family_matrix.py
(did another check of that it matches today, 4ht Feb)

I added some more reviewers just for awareness of these changes.

…or multi gpu runners.

…ime.

jayhawk-commits

Going through the new matrix format and I am liking many of the changes.

I like adding new boolean fields for control of things like test enablement, so we don't have to delete runner group names temporarily for disabling tests.

Clear separation of definitions between build, test, and release is great.

The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.

Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.

jayhawk-commits · 2026-02-04T20:23:12Z

What will become of the load_test_runner_from_gh_variables and get_all_families_for_trigger_types functions from the original matrix?

geomin12

overall looks good! just a few housekeeping comments

geomin12 · 2026-02-04T23:45:04Z

+Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake
+"""
+
+# blueprint = {


probably redundant as Data layout comment covers this

i left it as easy copy and paste if you need to add new archs. could maybe put it at the end of the file and add a note to the front?

IMO it's easier to copy/paste from an existing configuration. If the configurations were split between multiple files then having a clean reference/blueprint here would be more valuable.

(So I would suggest deleting this comment)

geomin12 · 2026-02-04T23:48:34Z

+            },
+        }
+    },
+    "gfx115x": {


there is some logic to make these keys lowercase (less mistake prone when declaring gpu families). should we make all these keys consistent and lowercase?

I'd go the other way and use uppercase consistently, to match the source of truth in https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake

geomin12 · 2026-02-04T23:50:27Z

+}
+
+
+amdgpu_family_info_matrix_all = {


super nit and this is just an opinion (don't need to implement): can we order these to match amdgpu_family_predefined_groups's order?

it started like this in the beginning... if moving i would rather move it into numeric ordering? what do you think?

HereThereBeDragons · 2026-02-05T13:39:37Z

What will become of the load_test_runner_from_gh_variables and get_all_families_for_trigger_types functions from the original matrix?

load_test_runner_from_gh_variables will be in the generator file e.g. look in get_github_event_args() in

TheRock/build_tools/github_actions/configure_amdgpu_matrix.py

Lines 470 to 474 in 89c5e71

    
           if os.environ.get("LOAD_TEST_RUNNERS_FROM_VAR", "false") == "true": 
        
               test_runner_json_str = os.getenv("ROCM_THEROCK_TEST_RUNNERS", "{}") 
        
               github_event_args["orgwide_test_runner_dict"] = json.loads(test_runner_json_str) 
        
           else: 
        
               github_event_args["orgwide_test_runner_dict"] = {}

get_all_families_for_trigger_types is not needed anymore. the generator will work on the entire arch dict and do the splitting based on either the predefined groups and/or the archs manually requested

HereThereBeDragons · 2026-02-05T14:19:30Z

The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.

yes the generator will have tests for it. see for current version ( #1732 ): https://github.com/ROCm/TheRock/blob/89c5e71a4ea42b537fe76572035ecd7f719a46b8/build_tools/github_actions/tests/configure_amdgpu_matrix_test.py

Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.

yeah a better overview would be nice. but definitely needs another PR for it. i also thought about splitting the file per group of families in case we get more archs.

jayhawk-commits · 2026-02-05T15:17:45Z

The blueprint provided is a good template, as the matrix is growing in complexity and prone to error from manual input. Will you be adding pytests verifying the contents of the matrix? I'd feel more comfortable with this PR landing if we had the tests.

yes the generator will have tests for it. see for current version ( #1732 ): https://github.com/ROCm/TheRock/blob/89c5e71a4ea42b537fe76572035ecd7f719a46b8/build_tools/github_actions/tests/configure_amdgpu_matrix_test.py

Another improvement that does not need to be in this pull request is some workflow and script combination that converts what is in the matrix into some .md file that has easier-to-read tables on what is being built and tested. Then, that workflow would run whenever the matrix gets changed to regenerate the .md file.

yeah a better overview would be nice. but definitely needs another PR for it. i also thought about splitting the file per group of families in case we get more archs.

Testing being done in the generator sounds good to me. Testing can add coverage for some of the nits from this PR like lower-case and ordering of items, which I don't have a strong opinion on.

HereThereBeDragons · 2026-02-05T18:09:10Z

i guess we can ignore those CI failures as the files added here are not used anywhere yet?

jayhawk-commits · 2026-02-05T18:23:35Z

i guess we can ignore those CI failures as the files added here are not used anywhere yet?

Agreed

ScottTodd · 2026-02-05T19:49:15Z

+Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake
+"""
+
+# blueprint = {


IMO it's easier to copy/paste from an existing configuration. If the configurations were split between multiple files then having a clean reference/blueprint here would be more valuable.

ScottTodd · 2026-02-05T21:51:29Z

+amdgpu_family_predefined_groups = {
+    # The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs).
+    "amdgpu_presubmit": ["gfx94Xxdcgpu", "gfx110x-all", "gfx1151", "gfx120x-all"],


typos here? - and consistent uppercase or lowercase of X?

Suggested change

amdgpu_family_predefined_groups = {

# The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs).

"amdgpu_presubmit": ["gfx94Xxdcgpu", "gfx110x-all", "gfx1151", "gfx120x-all"],

amdgpu_family_predefined_groups = {

# The 'presubmit' matrix runs on 'pull_request' triggers (on all PRs).

"amdgpu_presubmit": ["gfx94X-dcgpu", "gfx110x-all", "gfx1151", "gfx120X-all"],

yeah... i had it capitalized and then let myself convince from geo that lower case is also not a bad choice. i switch it back to "X". matching to the cmake file is definitely what i had on my mind in the original design ... which i then forgot over the last couple of months :(

ScottTodd · 2026-02-05T21:54:07Z

+            },
+        }
+    },
+    "gfx115x": {


I'd go the other way and use uppercase consistently, to match the source of truth in https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake

ScottTodd · 2026-02-05T21:57:27Z

+    "gfx103x": {
+        "dgpu": {


I don't understand the nesting here. Is this planning towards having e.g.

"gfx103x": { "all": { }, "dgpu": { }, "igpu": { } }

?

If we only ever have one at a time, it would be simpler to collapse down a level to just

"gfx103X-all": { }, "gfx103X-dgpu": { }, "gfx103X-igpu": { }

There there wouldn't an extra level of nesting between

"gfx115x": { "gfx1153": { // <--- single target "linux": { "build": { "expect_failure": True, "build_variants": ["release"], }, "gfx90x": { "dcgpu": { // <--- family grouping "linux": { "build": { "build_variants": ["release"],

Comments like this and my previous thought on converting the contents of the matrix in an easier to read format has me thinking that maybe it is worth exploring the reverse. We could define the matrix in format (.yml or ,json for example) that is easier to read and maintain, and then script to consume and test its contents for things we want to standardize like capitalization.

ScottTodd · 2026-02-05T22:06:20Z

+amdgpu_family_info_matrix_all = {
+    "gfx94x": {


I wonder if this would be easier to maintain and edit if we refactored like so:

amdgpu_family_gfx94X = { "dcgpu": { ... } } amdgpu_family_gfx110X = { "dcgpu": { ... } } amdgpu_family_info_matrix_all = { "gfx94X": amdgpu_family_gfx94X, "gfx110X": amdgpu_family_gfx110X, ... }

It would be possible to add a family and then forget to "register" it in the _all list though.

I had some similar thoughts back on #1097 and #1350 .

I don't understand the nesting here.

my idea of the nesting was that everything below "gfx103x" can either be group or arch that are part of "gfx103". just anything matching "gfx103" and adding on demand when needed for the CI.

if we would want to run "igpu", or just the single gpu target "gfx1030" or "gfx1032" we would have the structure like this:

"gfx103x": { "all": { }, "dgpu": { }, "igpu": { }, "gfx1030": { }. "gfx1032": { } }

thinking about it having separate files for better editing:
we could have

amdgpu_family_matrix.py amdgpu_family_gfx103x.py amdgpu_family_gfx950.py amdgpu_family_gfx94x.py ...

in which we do the same trick as we do for the pytorch skip testing:

all amdgpu_family_gfx<family>.py have the same dictionary name: amdgpu_family_info

amdgpu_family_matrix.py builds amdgpu_family_info_matrix_all by importing all amdgpu_family_gfx*.py files and adding the specific amdgpu_family_info to amdgpu_family_info_matrix_all

structure within amdgpu_family_gfx<family>.py would be:

amdgpu_family_info = { "gfx103x": { "all": { }, "dgpu": { }, "igpu": { }, "gfx1030": { }. "gfx1032": { } } }

maybe having a short call about it would be good?

just as an addition. this is the current plan in my head:

new structure for amdgpu matrix to know what belongs to build, test, release and be independent of if arch belongs to presubmit, postsubmit or nightly_ci

new configure_ci ( configure_amdgpu_matrix.py ) that allows for default values for each configuration step and removes entire matrices that are not required (e.g. no testing wanted = no test section in the matrix returned)

new setup.yml that propagates the container image to only have to set the image at a single place

add weekly ci

starting to move over the other cis to use the new layout (should be easy as the only the toplevel workflow elements need to be changed)

add multi arch generator in configure_amdgpu_matrix.py

looking then if and how the rest of the configurators (fetch_.. , configure_...,) etc can be integrated into configure_amdgpu_matrix.py

finish moving all cis to the new format

…ed in cmake/therock_amdgpu_targets.cmake

ScottTodd · 2026-02-12T17:48:13Z

+Cmake targets are defined in: cmake/therock_amdgpu_targets.cmake
+"""
+
+# blueprint = {


(So I would suggest deleting this comment)

ScottTodd · 2026-02-12T18:51:48Z

+                    },
+                    "sanity_check_only_for_family": True,
+                },
+                "release": {"push_on_success": True, "bypass_tests_for_releases": True},


nit: add a trailing comma after bypass_tests_for_releases so this wraps like the other entries

Suggested change

"release": {"push_on_success": True, "bypass_tests_for_releases": True},

"release": {

"push_on_success": True,

"bypass_tests_for_releases": True,

},

i blame pre commit hook :-P

HereThereBeDragons requested review from ScottTodd, amd-justchen, geomin12, jayhawk-commits, marbre and stellaraccident February 4, 2026 11:10

github-project-automation Bot added this to TheRock Triage Feb 4, 2026

github-project-automation Bot moved this to TODO in TheRock Triage Feb 4, 2026

HereThereBeDragons requested a review from amd-aakash February 4, 2026 13:43

HereThereBeDragons added 4 commits February 4, 2026 13:58

refactor and introduce a new amdgpu_family_matrix layout

a65c806

add description and rename test-runs-on-multi-gpu to test-multi-gpu f…

d2e3b1d

…or multi gpu runners.

add note about needing to edit both amdgpu_family files for the meant…

840c59f

…ime.

update new_amdgpu_family_matrix

614a797

HereThereBeDragons force-pushed the users/lpromber/refactor_amdgpu_matrix branch from b016770 to 614a797 Compare February 4, 2026 15:20

HereThereBeDragons added a commit that referenced this pull request Feb 4, 2026

get changes from users/lpromber/refactor_amdgpu_matrix #3248

17ea357

jayhawk-commits reviewed Feb 4, 2026

View reviewed changes

geomin12 reviewed Feb 4, 2026

View reviewed changes

change request: make note more verbose, make gpus all lower case

781770d

jayhawk-commits approved these changes Feb 5, 2026

View reviewed changes

ScottTodd reviewed Feb 5, 2026

View reviewed changes

fix type and change capitalization of gfx<213>X to match format as us…

64f39df

…ed in cmake/therock_amdgpu_targets.cmake

HereThereBeDragons requested a review from ScottTodd February 12, 2026 12:26

ScottTodd approved these changes Feb 12, 2026

View reviewed changes

change request

636b12d

HereThereBeDragons added a commit that referenced this pull request Feb 12, 2026

get changes from users/lpromber/refactor_amdgpu_matrix #3248

97b48a6

HereThereBeDragons merged commit fc87910 into main Feb 12, 2026
15 of 16 checks passed

HereThereBeDragons deleted the users/lpromber/refactor_amdgpu_matrix branch February 12, 2026 22:03

github-project-automation Bot moved this from TODO to Done in TheRock Triage Feb 12, 2026

HereThereBeDragons added a commit that referenced this pull request Feb 12, 2026

update to final version of new_amdgpu_matrix.py #3248

b48ee61

HereThereBeDragons added a commit that referenced this pull request Feb 12, 2026

update to final version of new_amdgpu_matrix.py #3248

0c1b080

ScottTodd mentioned this pull request Feb 18, 2026

Thread individual GPU targets through test fetch pipeline #3452

Merged

4 tasks

Conversation

HereThereBeDragons commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HereThereBeDragons commented Feb 4, 2026

Uh oh!

jayhawk-commits left a comment

Choose a reason for hiding this comment

Uh oh!

jayhawk-commits commented Feb 4, 2026

Uh oh!

geomin12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HereThereBeDragons commented Feb 5, 2026

Uh oh!

jayhawk-commits commented Feb 5, 2026

Uh oh!

HereThereBeDragons commented Feb 5, 2026

Uh oh!

jayhawk-commits commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HereThereBeDragons commented Feb 4, 2026 •

edited

Loading

HereThereBeDragons commented Feb 5, 2026 •

edited

Loading

HereThereBeDragons Feb 6, 2026 •

edited

Loading

HereThereBeDragons Feb 6, 2026 •

edited

Loading